Introduction

On daily basis, we produced and encountered huge text data in spoken or written forms and different languages. However, the only language computer understand is numbers. So, to be efficient, we need to train computers to understand spoken and written words. This can be achieved through Natural language processing (NLP). NLP gives computers ability to understand written text and spoken words in much the same way human beings can. It enables computers to process human language the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.

Aims

  1. To understand the basic relationships observed in the given data in order to interpret it more efficiently.
  2. To explore each data file, en_us.blog, en_us.new and en_us.twitter, for file sizes, number of characters, words and lines.
  3. To take samples of each file and combine them for further analysis
  4. To examing word distribution in each files using table, histogram and word cloud.
  5. To build \(N-grams\) of words to show relationship between words in the sample dataset
  6. To summarize word distribution and relationship in \(N-gram\) with histograms and wordclouds.
  7. To build backoff predictive model for next word prediction.

Useful Packages

library(tidytext, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(stringi, warn.conflicts = FALSE)
library(plotly, warn.conflicts = FALSE)
library(qdapRegex, warn.conflicts = FALSE)
library(wordcloud, warn.conflicts = FALSE)
library(RColorBrewer, warn.conflicts = FALSE)
library(syuzhet, warn.conflicts = FALSE)
library(SentimentAnalysis, warn.conflicts = FALSE)
library(sentimentr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)

Basic Data Exploration

Data Import

The three data file will be imported and sample taken for further analysis. We will remove profane word by filtering our data with words in the profanity_txt file.

  1. en_us.blogs
  2. en_us.news
  3. en_us.twitter
  4. Profane words
setwd("C:/Users/justi/Documents/Olu_Drive/Coursera/Data_Science_Statistics_and_Machine_Learning_Specialization/Capstone/en_US")
blogs_txt <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news_txt <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter_txt <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
profanity_txt <- readLines("profanity.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
profanity_df <- tibble(profanity_txt)
special_txt <- readLines("special.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
special_df <- tibble(special_txt)
stopWords_txt <- readLines("stopwords.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
stopWords_df <- tibble(stopWords_txt)

Brief Data Summary

Here we determined basic features of the data en_us.blog, en_us.new and en_us.twitter. The table below shows file sizes, number of characters (words, spaces and others), words and lines of each data.

File Type File Size Number of Characters Number of Words Number of Lines
Blogs 200.42 206824505 37546250 899288
News 196.28 15639408 2674536 77259
Twitter 159.36 162096241 30093413 2360148

Sampling

I will sample 0.5% of each data set blogs_txt, news_txt and twitter_txt to form a single set, sample1_txt. See below the first 3 lines of the sample-txt

[1] "Or put another way – in the spirit of this site’s mission – it’s all bollocks."
[2] "No Regrets for Our Youth – 0"                                                  
[3] "Tom: See you!"                                                                 

Data Preprocessing

First, we need to clean the data and remove irrelevant characters so we can concentrate on the important words from this file.

  1. Remove lines with latin1 and ASCII characters.
latin1ASII_func <- grep("latin1ASII", iconv(sample1_txt, "latin1", "ASCII", sub="latin1ASII"))
sample2_txt <- sample1_txt[-latin1ASII_func]
  1. Remove special characters, digits and extra white space.

See below the first 5 lines of the clean set.

sample3_txt <- gsub("&amp", " ", sample2_txt)
sample3_txt <- gsub("RT :|@[a-z,A-Z]*: ", " ", sample3_txt) # remove tweets
sample3_txt <- gsub("@\\w+", " ", sample3_txt)
sample3_txt <- gsub("[[:digit:]]", " ", sample3_txt) # remove digits
sample3_txt <- gsub(" #\\S*"," ", sample3_txt)  # remove hash tags 
sample3_txt <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample3_txt) # remove url
sample3_txt <- gsub("[^[:alnum:][:space:]']", "", sample3_txt) # Remove punctuation except apostrophes
sample3_txt <- rm_white(sample3_txt) # remove extra spaces using `qdapRegex` package

See below the first 5 lines of the clean set.

[1] "Tom See you"                                                                                                                                           
[2] "See it's all the fault of evolution"                                                                                                                   
[3] "But seriously Wells Youngs WHAT IS THIS BULL CRAP ABOUT NOT SELLING IT IN THE UK UNTIL NEXT YEAR Get it sorted I want to be drinking this at Christmas"

Tokenization

A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.

Create one-token-per-document-per-row

We need to both break the text into individual tokens and transform it to a tidy data structure. This equivalent to a unigram \(1-gram\). Also, we need to filter out the profane word from the text corpus.

sample_df2 <- sample_df %>%
  unnest_tokens(word, text) %>% 
  filter(!word %in% profanity_df$profanity_txt) %>% 
  filter(!word %in% special_df$special_txt) %>% # Remove profane words
  drop_na()

Find most frequent words

The count() will will be useful here. This will help use to visualize the dataset. See below five most frequent word and their frequencies \(n\).

unigram <- sample_df2 %>% 
  count(word, sort = TRUE) %>% 
  mutate(word = reorder(word, n)) %>% 
  filter(n > 10)
head(unigram, 5)
# A tibble: 5 x 2
  word      n
  <fct> <int>
1 the   10521
2 to     7141
3 i      5938
4 a      5843
5 and    5626

Data Visualization of the data

Create histogram

We use ggplot to generate the histogram and line graph below. Word occurrence is more than 800 times.

Line graph

Word cloud

# Sentiment and Emotion Analysis Sentiment analysis is use to systematically identify, extract, quantify, and study affective states and subjective information from text data. It help to understand the social sentiment in a data. while Emotion analysis identify and analyze the underlying emotions expressed in the data such as good or bad, sad or happy etc.

Sentiment Analysis

The pie chart show that most of the sentiments expressed in the sample text are positive.

r, echo=FALSE, warning=FALSE, message=FALSE, comment=NA}

sentiment_txt <- sample_df2 %>% 
  filter(!word %in% profanity_df$profanity_txt) %>%
  filter(!word %in% special_df$special_txt) %>%
  anti_join(stop_words)

sentiment_txt <- sentiment_txt$word
  
  
sentiment_df <- analyzeSentiment(sentiment_txt)

# Save data to r object
saveRDS(sentiment_df, "sentiment_df.rds")

# Extract dictionary-based sentiment according to the QDAP dictionary
SentimentQDAP_df <- sentiment_df$SentimentQDAP

# View sentiment direction (i.e. positive, neutral and negative)
sentimentDirection_char <- convertToDirection(SentimentQDAP_df)
sentimentDirection_df <- data.frame("SentimentDirection" = sentimentDirection_char)

# Combine sentiment direction with SentimentQDAP in a data set
sentimentDirection_df$SentimentQDAP <- sentiment_df$SentimentQDAP

# Draw a pie chart
sentimentDirection_df %>% 
  drop_na() %>% 
  ggplot(., aes(x = "", y = SentimentDirection, fill = SentimentDirection)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  theme_void()

Emotion Analysis

The histogram below reveal the emotion expressed in the data set.

r, echo=FALSE, warning=FALSE, message=FALSE, comment=NA, results=FALSE}

emotion_txt <- sample_df2 %>% 
  filter(!word %in% profanity_df$profanity_txt) %>%
  filter(!word %in% special_df$special_txt) %>%
  anti_join(stop_words)

emotion_txt <- emotion_txt$word

emotion_df <- setDF(emotion_by(get_sentences(emotion_txt)))

# Save data to r object
saveRDS(emotion_df, "emotion_df.rds")

emotion_df$emotionType <- as.character(emotion_df$emotion_type)
emotion_df2 <- emotion_df %>% 
  select(!emotion_type) %>% 
  filter(!emotionType %in% c("anticipation_negated", "fear_negated",
                             "surprise_negated", "disgust_negated",
                             "sadness_negated", "joy_negated", "trust_negated",
                             "anger_negated"))
# Histogram
ggplot(emotion_df2) +
  aes(reorder(emotionType, emotion_count), emotion_count, fill = emotionType) +
  geom_histogram(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  labs(x = "Emotion", y = "Frequency")

N-Grams Analysis

Essentially, the step under Tokenization above is equivalent to \(1-gram\). From here, we visualize the data inform of \(2-gram\), \(3-gram\) and \(4-gram\).

Bigram

Creating Bigram

Generate \(2-gram\) token and remove profane words.

bigram <- sample_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ", 
           extra = "drop", fill = "right") %>%
  filter(!word1 %in% profanity_df$profanity_txt,
         !word2 %in% profanity_df$profanity_txt,
         !word1 %in% special_df$special_txt,
         !word2 %in% special_df$special_txt) %>% # Remove profane words
  drop_na() %>% 
  unite(bigram, word1, word2, sep = " ")

Find Most Frequent Words in the Bigram

See below five most frequent bigram and their frequencies \(n\).

# A tibble: 5 x 2
  bigram      n
  <fct>   <int>
1 in the    868
2 of the    854
3 for the   597
4 on the    477
5 to the    466

Bigram visualization

The histogram and line graph show \(2-gram\) with occurrence of more than 200 times.

Histogram

Line Graph for Bigram

Word cloud for Bigram

Trigram

Creating Trigram

Generate \(3-gram\) token and remove profane words

trigram <- sample_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% 
  separate(trigram, c("word1", "word2", "word3"), sep = " ", 
           extra = "drop", fill = "right") %>%
  filter(!word1 %in% profanity_df$profanity_txt,
         !word2 %in% profanity_df$profanity_txt,
         !word3 %in% profanity_df$profanity_txt,
         !word1 %in% special_df$special_txt,
         !word2 %in% special_df$special_txt,
         !word3 %in% special_df$special_txt) %>% # Remove profane words
  drop_na() %>% 
  unite(trigram, word1, word2, word3, sep = " ")

Find Most Frequent Words in the Trigram

See below five most frequent trigram and their frequencies \(n\).

# A tibble: 5 x 2
  trigram                n
  <fct>              <int>
1 thanks for the       127
2 one of the            64
3 a lot of              62
4 i want to             58
5 looking forward to    57

Trigram visualization

The histogram and line graph show \(3-gram\) with occurrence of more than 40 times.

Histogram

Line Graph for Trigram

Word cloud for Trigram

Quadgram

Creating Quadgram

Generate \(4-gram\) token and remove profane words.

quadgram <- sample_df %>%
  unnest_tokens(quadgram, text, token = "ngrams", n = 4) %>% 
  separate(quadgram, c("word1", "word2", "word3", "word4"), sep = " ", 
           extra = "drop", fill = "right") %>%
  filter(!word1 %in% profanity_df$profanity_txt,
         !word2 %in% profanity_df$profanity_txt,
         !word3 %in% profanity_df$profanity_txt,
         !word4 %in% profanity_df$profanity_txt,
         !word1 %in% special_df$special_txt,
         !word2 %in% special_df$special_txt,
         !word3 %in% special_df$special_txt,
         !word4 %in% special_df$special_txt) %>% # Remove profane words
  drop_na() %>% 
  unite(quadgram, word1, word2, word3, word4, sep = " ")

Find Most Frequent Words in the Quadigram

See below five most frequent quadgram and their frequencies \(n\).

# A tibble: 5 x 2
  quadgram                  n
  <fct>                 <int>
1 thanks for the follow    35
2 thank you for the        20
3 the end of the           20
4 at the end of            18
5 for the first time       17

Quadigram visualization

The histogram and line graph show \(3-gram\) with occurrence of more than 10 times.

Histogram

Line Graph for Quadigram

Word cloud for Quadigram

## Quintgram

Creating Quintgram

Generate \(5-gram\) token and remove profane words.

quintgram <- sample_df %>%
  unnest_tokens(quintgram, text, token = "ngrams", n = 5) %>% 
  separate(quintgram, c("word1", "word2", "word3", "word4", "word5"), sep = " ", 
           extra = "drop", fill = "right") %>%
  filter(!word1 %in% profanity_df$profanity_txt,
         !word2 %in% profanity_df$profanity_txt,
         !word3 %in% profanity_df$profanity_txt,
         !word4 %in% profanity_df$profanity_txt,
         !word5 %in% profanity_df$profanity_txt,
         !word1 %in% special_df$special_txt,
         !word2 %in% special_df$special_txt,
         !word3 %in% special_df$special_txt,
         !word4 %in% special_df$special_txt,
         !word5 %in% special_df$special_txt) %>% # Remove profane words
  drop_na() %>% 
  unite(quintgram, word1, word2, word3, word4, word5, sep = " ")

Find Most Frequent Words in the Quintgram

See below five most frequent quadgram and their frequencies \(n\).

# A tibble: 5 x 2
  quintgram                            n
  <fct>                            <int>
1 the santelena hotel venice italy    11
2 at pates and fountain parks         10
3 at the end of the                    9
4 classic at pates and fountain        8
5 in the middle of the                 7

Histogram

Modeling and Prediction

We will use \(2-gram\), $3-grams) and \(4-grams\) to build the required model for next word prediction.

Building the n-grams database

Split each \(N-gram\) into the constituent words and store them back into the same data frame

Simple prediction model using \(2-gram\).

Find next word for “good”.

Filter data frame where first word is “good” to find the next possible words.

[1] "morning" "to"      "luck"    "day"    

The next possible words are as shown above. We will build a more reliable model going forward.

Building prediction model

First, we will creating matching functions for each \(N-grams\)

# Bigram matching
bigram_func <- function(inputWords){
  num <- length(inputWords)
  # Number of rows to be selected
  nRow <- 1L
  filter(bigram_df, word1==inputWords[num]) %>%
    add_count(word2, sort = TRUE) %>%
    top_n(3, n) %>% 
    filter(row_number() == nRow) %>%
    select(num_range("word", 2)) %>%
    as.character() -> out
  ifelse(out =="character(0)", "?", return(out))
}

# Trigram matching
trigram_func <- function(inputWords){
  num <- length(inputWords)
  # Number of rows to be selected
  nRow <- 1L
  filter(trigram_df,
         word1==inputWords[num-1],
         word2==inputWords[num])  %>%
    add_count(word3, sort = TRUE) %>%
    top_n(3, n) %>%
    filter(row_number() == nRow) %>%
    select(num_range("word", 3)) %>%
    as.character() -> out
  ifelse(out=="character(0)", bigram_func(inputWords), return(out))
}

# Quadgram matching
quadgram_func <- function(inputWords){
  num <- length(inputWords)
  # Number of rows to be selected
  nRow <- 1L
  filter(quadgram_df,
         word1==inputWords[num-2],
         word2==inputWords[num-1],
         word3==inputWords[num])  %>%
    add_count(word4, sort = TRUE) %>%
    top_n(3, n) %>%
    filter(row_number() == nRow) %>%
    select(num_range("word", 4)) %>%
    as.character() -> out
  ifelse(out=="character(0)", trigram_func(inputWords), return(out))
}

Next Word Prediction Function

This function will be use to predict next word when a word or phrase is entered.

ngrams_func <- function(wordPhraseInput){
  # Create a dataframe
  wordPhraseInput <- data.frame(text = wordPhraseInput)
  # Clean the Inpput
  replace_reg <- "[^[:alpha:][:space:]]*"
  wordPhraseInput <- wordPhraseInput %>%
    mutate(text = str_replace_all(text, replace_reg, ""))
  # Find word count, separate words, lower case
  inputCount <- str_count(wordPhraseInput, boundary("word"))
  inputWords <- unlist(str_split(wordPhraseInput, boundary("word")))
  inputWords <- tolower(inputWords)
  # Call the matching functions
  out <- ifelse(inputCount == 1, bigram_func(inputWords), 
                ifelse(inputCount == 2, trigram_func(inputWords),
                       quadgram_func(inputWords)))
  # Output
  return(out)
}

Predict next word

Predict next word for the following words and phrases.

  1. happy
  2. my new
  3. good to see
  4. just to let you
  5. thank you so
ngrams_func("happy")
[1] "birthday"
ngrams_func("my new")
[1] "york"
ngrams_func("good to see")
[1] "you"
ngrams_func("just to let you")
[1] "are"
ngrams_func("thank you so")
[1] "much"